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^Sj , Abstract: In some socio-economic surveys, data are collected on sensitive or stigmatizing issues 

■ such as tax evasion, criminal conviction, drug use, etc. In such surveys, direct questioning of 
respondents is not of much use and the randomized response technique is used instead. A few 

^ ' researchers have studied the issue of privacy protection or respondent jeopardy for surveys on 

dichotomous populations, where the objective is to estimate the proportion of persons bearing 

H 

' the sensitive trait. However, not much is yet known about respondent protection when the vari- 

. able under study takes discrete numerical values and the objective of the survey is to estimate 

the population mean of this variable. In this article we study this issue. We first propose a 

randomization device for this situation and give the corresponding estimation procedure. We 

I next propose a measure of privacy and show that given a certain stipulated level of this privacy 
CN ■ 

■ measure, we can determine the parameter of the randomization device so as to maximize the 
ly-^ \ efficiency of estimation, while guaranteeing the desired level of privacy protection. In particular, 
2 ■ our study also covers the case of polychotomous populations and we can estimate the propor- 

\ tions of individuals belonging to the different classes. Consequently, results for dichotomous 
populations follow as corollaries. 
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1 Introduction 

The randomized response technique is a useful method for collecting data on variables which 
are considered sensitive, incriminating or stigmatizing for the respondents. Examples of such 
situations are common in socio-economic surveys, for instance, we may need to collect data on 
tax evasion, alcohol addiction, illegal drug use, criminal behaviour or past criminal convictions. 
In such surveys, direct questions are not useful as the respondents will either refuse to answer 
embarrassing questions or, even if they do, may give false answers. In a randomized response 
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model, the respondents use a randomization device to generate a randomized response and the 
parameter under study can be estimated from these responses. So, the respondent is not required 
to disclose his true response and it is expected that this will lead to better participation in the 
survey on sensitive issues. 

Warner (1965) introduced the randomized response technique for estimating the proportion 
of persons bearing a sensitive attribute in a dichotomous population. In Warner's model, with 
population categories A and A'^, a box with two types of cards labeled A and A'^ (in proportion 
p : 1 — p) is used as the randomization device. A respondent draws a card at random and 
responds 'yes' or 'no' according as whether or not he belongs to the card type he draws. Since 
then, several researchers have extensively contributed to this area, e.g., Kuk (1990), Ljungqvist 
(1993), Mangat (1994), Chua and Tsui (2000), Van den Hout and Van der Heijden (2002), 
Christofides (2005) and many others. For details on the results available on this technique we 
refer to the review paper by Chaudhuri and Mukerjee (1987) and books by Chaudhuri and 
Mukerjee (1988) and Chaudhuri (2011). 

Lanke (1976) and Leysieffer and Warner (1976) initiated the study of efficiency versus privacy 
protection in randomized response surveys where the population is divided into two complemen- 
tary sensitive groups, A and A'^, and the objective is to estimate the proportions of persons 
belonging to these groups. They suggested measures of jeopardy based on the 'revealing prob- 
abilities', i.e., the posterior probabilities of a respondent belonging to groups A and A'^ given 
his randomized response. Since then, this dichotomous case has been widely studied. Loynes 
(1976) extended the jeopardy measure of Leysieffer and Warner (1976) to polychotomous pop- 
ulations. Ljungqvist (1993) gave a unified and utilitarian approach to measures of privacy for 
the dichotomous case. Nayak and Adeshiyan (2009) proposed a measure of jeopardy for surveys 
from dichotomous populations and developed an approach for comparing the available random- 
ization procedures. These results are all based on samples drawn by simple random sampling 
with replacement. 

All the references given above arc for sensitive variables which arc categorial or qualitative 
in nature. However, in randomized response surveys it is quite common to have situations where 
the study variable X is quantitative, e.g. in studies on the number of criminal convictions of 
a person, the number of induced abortions, the number of months spent in a correction centre, 
the amount of undisclosed income, etc. Anderson (1977) studied the case of continuous sensitive 
variables and considered the amount of information provided by the randomized responses. For 
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ensuring more privacy he recommended that the expectation of the conditional variance of X 
given the randomized response be made as large as possible. However, not much work seems to 
have been done in studying the respondent privacy aspect for discrete-valued sensitive variables, 
even though surveys are often undertaken on such variables. 

To fill this gap, in this article we focus on studying the issue of privacy protection when 
the underlying variable under study is quantitative and discrete. We propose the use of a 
randomization device and give the associated estimation method. Then, we consider two separate 
cases, one where all values of X are sensitive and another where not all values of X are sensitive. 
For each of these cases, we propose a measure for protecting the privacy of the respondents. We 
finally show how one can choose the randomization device parameter in each case, so as to 
guarantee a certain prc-specified level of respondent protection and then maximize the efficiency 
of estimating the parameter of interest under this constraint. Our study also covers qualitative 
sensitive variables, i.e., cases where the population is dichotomous or polychotomous, and allows 
us to estimate the proportions of individuals belonging to each category. 

In Section 2 we give some preliminaries. In Sections 3 and 4 we consider the issues of esti- 
mation and privacy protection, respectively. In Section 5 we obtain the randomization device 
parameter which allows efficient estimation while assuring the required level of respondent pro- 
tection and illustrate with some numerical examples. In the concluding section we show how 
our study covers the case of polychotomous variables. 

2 Preliminaries 

Consider a population with N individuals labeled 1, . . . , A^. Let X denote the sensitive variable 
of interest. We assume that X takes a finite number of values xi, . . . , Xm and without loss of 
generality, we may suppose these m values to be known. For 1 < z < m, let tt^ be the unknown 
population proportion of individuals for whom X equals Xj, i.e., 

m 

Prob(X = Xj) = TTj, 1 < i < m, where tti > 0, ^ tti = 1, (1) 

i=l 

The objective of the survey is to estimate the population mean of X. For this, we suppose 
as usual (cf. Warner (1965), Nayak and Adeshiyan (2009) and others), that a sample of n 
individuals is drawn from the population by simple random sampling with replacement. As for 
the randomization device, since we are interested in the numerical values of X, we propose the 
use of a device as described below. 
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Consider a box containing cards of (m + 1) types, the iih type of card being marked 'Report 
Xi as your response', 1 < i < m, while the (m + l)th type of card is marked: 'Report your true 
value of X as your response.' The box has a large number of cards, say M, there being Mp 
cards of type (m + 1) and cards of each of the types i, 1 < i < m, < p < 1. A sampled 

respondent is asked to draw a card at random from the box and then give a truthful response 
according to the card drawn by him, without disclosing the label on the card to the investigator. 
Thus the true value of X for the respondent is not known. The n responses so received are the 
data from this survey. 

Let R denote the randomized response variable. Clearly, with this device, the ranges of 
R and X match. The efficiency in estimation and respondent protection will depend on the 
choice of the value of p, which we call the device parameter. The above device is such that with 
probability p, a respondent will report his true value, while with probability he will report 
any one of the possible values xi, . . . ,Xm chosen at random, i.e., 

1 - P 

Prob(i? = Xi|X = x,) = l<ij^j<m, (2) 

m 

1 -p 

Prob(i? = Xj\X = Xj) = p-{ , I < j <m. (3) 

3 Estimation of population mean 

The population mean and variance of X are given by 

m m 

fJ-x = ^ XiiTi and ax = '^{xi - jJ-xfi^i, 

i=l 1=1 

respectively. Our objective is to estimate from the n randomized responses collected as 
described in Section 2. Let Wi be the sample proportion of randomized responses which equal 
Xi, 1 < i < m. Hence, from (l)-(3), 

1-p 

E{wi) = Prob(i? = Xi) = p-Ki + = Aj, say. (4) 

So, an unbiased estimator of tTj will be given by tTj = ^{wi — ^^), leading to an unbiased 
estimator of iix as 

MX = V XiT^i = - > XiWi > Xi. 

p'H' mp 

1=1 ^ 1=1 ^ 1=1 

Then, on simplification using (4), and writing X = ^ YllLi Xi, the variance of fix is given by 

Var(/ix) = -2 Var(^ XiWi) = {Y. ^i^iC^ " ^0 - I] XiXj^Xj 
P i=i [i=i i^j=i 
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2 .^^^ + {l-p)-J2i^i-Xf+p{p-l){^ix-X)H. (5) 

Our aim is to estimate nx keeping Var(/tx) as small as possible. It is clear from the expression 
on the right side of (5) that Var(/ix) is decreasing in p, irrespective of the values of tti, . . . , vTm- 
So, this variance may be decreased, or equivalently, the efficiency of estimation may be increased 
by increasing p, whatever may be the proportions of the Xi values in the population. 



4 Privacy protection 

To study the respondent privacy aspect for dichotomous populations, Leysieffer and Warner 
(1976) studied the case where both A and A'^ are sensitive categories while Lanke (1975) also 
considered the case where only A is sensitive and there is no jeopardy in a 'no' answer to the 
sensitive question. For polychotomous populations, Loynes (1976) studied two cases, one where 
all categories are stigmatizing and another where one of the categories is not stigmatizing. In 
line with these, we too consider the privacy issue for two situations, one where all the m values of 
X are stigmatizing and another where not all values of X are stigmatizing. Both these situations 
commonly arise in practice and we require separate privacy protection measures for them. 

For a randomly chosen respondent from the population, the 'true' probability that the value 
of X for this respondent equals Xi is given by Prob(X = Xi). On the other hand, when this 
respondent gives a randomized response, say Xj, then the probability that the value of X for 
this respondent equals Xi is now given by the conditional probability Prob(X = Xi\R = Xj), or 
the 'revealing' probability. 

4.1 All values of X are stigmatizing 

Suppose all the values xi, . . . ,Xm are stigmatizing. In this case, a respondent would feel com- 
fortable in participating in the survey if the perception of his having a value X = Xi is not much 
altered after knowing his randomized response, for all 1 < z < m. This would require that his 
true and revealing probabilities be sufficiently close. Starting from this basic premise we define 

aij = |Prob(X = Xi\R = Xj) - Prob(X = Xi)\ (6) 
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and since each respondent would want to be as small as possible for all 1 < i,j < m, as a 
measure of privacy protection we propose the following measure: 

a = max aij. (7) 

A randomization device with a privacy protection value a = ctg would guarantee that the 
discrepancies between the true and revealing probabilities will be at most for all respondents, 
irrespective of their true values. Thus a device which results in a lower value of a gives a higher 
level of privacy protection than one with a higher value of a. 

Suppose the scientist planning a certain survey would like to keep the privacy protection 
available to respondents above a certain threshold, i.e., would like to achieve a < ^, where ^ is 
a pre-assigned quantity, < ^ < 1. Moreover, this bound on a should hold irrespective of the 
unknown values of tti, . . . , TTm- The following theorem shows how the device parameter can be 
chosen to achieve this. 

Theorem 1. For a as in (7) and a preassigned ^, where 0<^<1, a < ^ will hold, irrespective 
of the values o/ vri, . . . , iTm, if and only if p < po, where 

Po = \ . . (8) 

Proof. From (l)-(3), using Bayes' Theorem it follows that for 1 < < m, 



prob(x = xi\R= = = ^^^^ , r-. ^ (9) 



TO 



where 5ij is Kronecker Delta. Hence from (6) it follows that aij = p^'I'^j ^j^jl a^^j for any i ^ j, 

«. = ^^<^^^ = «.. (10) 
as TTj + TTj < 1 for all i,j. Thus a = max a,,- = max ^^j) _ Hence, a < ^ if and only if 

l<j<m l<j<m 7^^ + — 

TTjil - TTj) - i-Kj < ^^^^^ for all 1 < j < m. (11) 
First suppose p < po- Then for 1 < j < m, 

- TTj) - ^TTj = (^^) - - 

< -TT- = > usmg (8) 

\ 2 J mpo 

< ^(1-P) _ 
~ mp 



Thus the inequahties in (11) hold, or equivalently a < ^, irrespective of the values of tti, . . . , tt^. 

To prove the converse, suppose a < ^, or equivalently, the inequalities in (11) hold, irrespec- 
tive of the values of tti, . . . , -Km- Then, for vri = ^-2^,112 = ^-5^, vrs = . . . = vTm = 0, in particular, 
these inequalities will also hold. So, for this choice of ttj values in (11) with j = 1, we have 



Remark 1. It is clear from (8) that in order to maintain the same level of protection, the 
value of po monotonically decreases with the number of possible values of X. Again, for a given 
number of possible values of X, po monotonically increases with ^. We may reiterate that these 
values of p do not depend on how the values of X are distributed in the population. 

4.2 Not all values of X are stigmatizing 

In many surveys it may so happen that not all values of X are sensitive or stigmatizing. For 
instance, in a survey for estimating the average number of criminal convictions of persons in a 
certain population, the value X = is not stigmatizing but any value of X > 1 could well be 
stigmatizing. Similarly, for a survey for estimating the average of the number (X) of induced 
abortions, the values X = ov X = 1 might not be considered as stigmatizing values while 
other larger values might be considered stigmatizing by the respondents. 

To study the respondents' privacy protection for such surveys, we present here the simpler 
case where only one of the values of X, say xi, is not stigmatizing, while values X2, ■ ■ ■ , Xm are 
considered stigmatizing. We develop the protection measure for this case in detail. Later we 
remark that the results obtained for this case may be easily extended to the case where X has 
more than one non-stigmatizing values. 

As before, the data collection and estimation proceeds as in Sections 2 and 3. To study the 
respondent protection we note that since the value xi is non-stigmatizing, respondents will feel 
comfortable with a randomization device for which the 'revealing' probability of their having a 
true value xi will be large. So, we propose the following measure of privacy: 




So from (8), p < po- Hence theorem. 



□ 



min P(X = xi\R = Xi) 
i<j<m ^ ' 




(12) 
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on simplification using (9). A device with a privacy protection value /3 will guarantee that all 
respondents are perceived to have X = xi with probability at least /3. So, a device leading to a 
larger value of P will ensure greater privacy to respondents than one with a smaller /?. 

Let ^, < ^ < 1, denote a preassigned level of respondents' privacy. Then in order to achieve 
this level of protection we require that (3 > irrespective of the values of tti, . . . , -Km- Thus we 
should have 

1 — p 1 — p 

{pSij H )7ri > ^{piTj H ), 1 <j <m, 

m m 

or equivalently, the following inequalities should hold: 

+ > (13) 

m m 

and ^—^■Ki-ip'Kj > 2<j<m. (14) 

m m 

Clearly, no p can satisfy (13) irrespective of tti, . . . , tt^ for any given ^ since (13) fails as tti — )■ 0. 

So we assume that tti > and we also assume some prior knowledge about a lower bound on 

TTi- This assumption is quite realistic because in most populations there will be an appreciable 

number of persons with a non-stigmatizing variable value and hence, a lower bound to the 

proportion of such stigma-free persons in the population will be available. 

Thus, suppose we have prior knowledge that vri > c. We work with ^ < c. This is again 

realistic because if the only knowledge about vri is that vri > c, it is impractical to demand that 

P{X = xi\R = Xj) > ^(> c) for all j. Now, the following theorem gives the value of the device 

parameter p which will guarantee the desired level of respondent protection ^. 

Theorem 2. Let (3 be as in (12) and tti > c for some known c. Then given a preassigned ^, 

where < ^ < c, P > ^ will hold, irrespective of the values of tti, . . . , tt^, if and only if p < Po, 

where 

Po = — ^ . (15) 

^ + e(i-c) 

Proof. Since tti > c, it is clear that ttj < I — c for 2 < j < m and we have 

[p(i-e) + ^]^i > [p(i-o + ^]c 

m m 

1 — p 1 — p 

and TTi — ^pTT,- > c — ^p(l — c), 2 < j < m. 

m m 

As a result, (13) and (14) will hold, irrespective of the true values of 7ri(> c),7r2, . . . jiTm iff 

b(l-0 + ^]c > (16) 
m m 

and ^c-^p{l-c) > (17) 
m m 



hold. Now, (16) reduces to 



(p + L^)c>e(cp + ^) 
m m 



which will always hold for every p since ^{cp + < i{p + ^7^) < c{p + as ^ < c and 
p + > 0. So, it is enough to only consider (17). Note that 

(17) <^ ^p{l - c) > 

m m 

<^P < = — 7 — = Po, 

thus proving the theorem. □ 
Remark. The above discussion can be extended to include the more general case where X 
has t non-stigmatizing values a;i, . . . , x^, say, while its remaining m — t values are stigmatizing, 
1 < t < m. In that case too, it can be shown that pq takes the form as in Theorem 2, but now 
with 

fi = mini<j<rnP{^ = xi or X2 or ... Xt\R = Xj) and tti + . . . + ttj > c with ^ < c. 

5 Privacy protection together with efficiency in estimation 

We now consider the issue of efficiency in estimation together with privacy protection in ran- 
domized response surveys. It was seen from (5) that, irrespective of the values of vri, . . . ,7rm, 
the efficiency of estimation may be increased by increasing p. On the other hand, for a given ^ 
and irrespective of the values of tti, . . . , tt^. Theorems 1 and 2 show that a protection of a < ^ 
OT P > may be guaranteed iff p < po, where po is as in (8) or (15), respectively. So, the 
best choice of p with regard to maximizing the efficiency of estimation of fix, subject to the 
stipulated level of privacy protection ^, is p = pq. The following examples illustrate this. 

Example 5.1 Let X take four values which are all sensitive. Suppose ^ = 0.1 Then by Theorem 
1, Po = 0.1099. So, if we use a randomization device with p = 0.1099 then the efficiency of 
estimation can be maximized while guaranteeing that the maximum discrepancy between the 
true probability and the revealing probability of all respondents will be at most 0.1. □ 

The following table gives the po values in (8) for some choices of ^ and m. 
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m 




Po 




m 


t. 


Po 




m 




Po 


3 


0.1 


0.1413 




4 


0.1 


0.1099 




5 


0.1 


0.0899 


3 


0.2 


0.2941 




4 


0.2 


0.2381 




5 


0.2 


0.2000 


3 


0.3 


0.4494 




4 


0.3 


0.3797 




5 


0.3 


0.3288 


3 


0.4 


0.5970 




4 


0.4 


0.5263 




5 


0.4 


0.4706 



Example 5.2 Let X take one nonsensitive value and two sensitive values. Suppose it can be 

assumed that at least 15% of the individuals in the population possess the nonsensitive value 
and suppose it is stipulated that ^ = 0.10. Then by Theorem 2, pQ = 0.1639. So, if we use 
a device with p = 0.1639 then estimation efficiency will be maximum while guaranteeing that 
all respondents will have at least a 10% probability of being revealed as belonging to the non- 
stigmatizing class. □ 



6 Estimation of population proportions 

As mentioned in Section 1, several researchers have estimated the proportions of individuals 
belonging to the two categories in dichotomous populations, while Loynes (1976) extended this 
to estimating the different proportions in a polychotomous population. In our case where X 
takes m numerical values, we may also readily estimate the population proportions vri, . . . , iTm 
from the responses collected as in Section 2 and again use the measures of privacy as given in 
(7) and (12) to achieve the stipulated level of privacy protection. 
As seen in Section 3, an unbiased estimate of TTj is 

TTi = -iwi J, 1 < I < m. 

p m 

Suppose, in the spirit of A— optimality commonly used in optimal design theory, we would like 

to minimize the average variance of these estimates. For this, we can show that 

m , m 1 f 1 111 

Y^Vari^i) = —J2Xi{l - AO = - K - E^' + -(- - 1) ' (l^) 
fr{ np^fr[ n [p^ ^ m p^ J 

on simplification, using (4). Clearly, (18) is decreasing in p, irrespective of the true values of 
TTi, . . . ,Trm- So as in the case of estimating the mean, here too, given some ^, subject to the 
constraint on protection of privacy, the best choice for p for minimizing the average variance of 
the estimates of the proportions, is p = pQ, with pQ being given by (8) or (15), as the case may 
be. The popular case of dichotomous populations follow by taking m = 2 in the above. 
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